More on Methods for Feature Selection

When building a predictive model, it is likely that not only will you be interested in the effectiveness of your model as a predictor, but also that you will be interested in what features/predictors are important. Eliminating useless predictors will simplify your model and might even improve it by reducing noise.

The first, and most simplistic way to go about this is to eliminate predictors with variance below some threshold. This isn't very robust and isn't illustrated here.


In [1]:
from IPython.core.pylabtools import figsize
import numpy as np
from sklearn.datasets import make_regression
from statsmodels.api import OLS
import matplotlib.pyplot as plt
from sklearn.linear_model import lasso_path, LassoCV
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.feature_selection import f_regression

%matplotlib inline
plt.style.use('bmh')

In [2]:
X, y = make_regression(n_samples = 1000, n_features = 50, n_informative = 3, tail_strength = 1, random_state=0)

The next (and the traditional) appproach: make the requisite assumptions, then do inference based on $t$-statistics and $p$-values. To do this properly, we should follow up with either forward or backward stepwise regression. When performing stepwise regression, all possible sets of predictors should be tested. This is more robust than simply eliminating predictors with low variance, but when the number of predictors in large it can be impractical.


In [3]:
model = OLS(y, X)
results = model.fit()
print(results.summary())


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 5.400e+30
Date:                Tue, 26 Jul 2016   Prob (F-statistic):               0.00
Time:                        20:10:24   Log-Likelihood:                 27905.
No. Observations:                1000   AIC:                        -5.571e+04
Df Residuals:                     950   BIC:                        -5.547e+04
Df Model:                          50                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1         -5.329e-15   6.28e-15     -0.849      0.396     -1.77e-14  6.99e-15
x2         -1.688e-14   6.19e-15     -2.724      0.007      -2.9e-14 -4.72e-15
x3          2.298e-14   6.22e-15      3.695      0.000      1.08e-14  3.52e-14
x4          4.663e-14   6.02e-15      7.742      0.000      3.48e-14  5.84e-14
x5         -2.087e-14   6.11e-15     -3.414      0.001     -3.29e-14 -8.88e-15
x6                  0   6.24e-15          0      1.000     -1.23e-14  1.23e-14
x7          2.376e-14   6.36e-15      3.735      0.000      1.13e-14  3.62e-14
x8         -2.442e-14   6.15e-15     -3.971      0.000     -3.65e-14 -1.24e-14
x9          3.658e-14    6.2e-15      5.899      0.000      2.44e-14  4.88e-14
x10        -3.553e-15   6.23e-15     -0.570      0.569     -1.58e-14  8.68e-15
x11        -4.552e-14   6.15e-15     -7.407      0.000     -5.76e-14 -3.35e-14
x12         2.776e-14   6.16e-15      4.506      0.000      1.57e-14  3.98e-14
x13        -2.665e-14   6.16e-15     -4.322      0.000     -3.87e-14 -1.45e-14
x14         2.953e-14   5.87e-15      5.027      0.000       1.8e-14  4.11e-14
x15        -5.718e-15   6.07e-15     -0.941      0.347     -1.76e-14   6.2e-15
x16        -2.759e-14   6.06e-15     -4.550      0.000     -3.95e-14 -1.57e-14
x17        -1.177e-14    5.9e-15     -1.994      0.046     -2.33e-14 -1.87e-16
x18        -3.941e-15   6.28e-15     -0.627      0.531     -1.63e-14  8.39e-15
x19           50.9414   6.19e-15   8.23e+15      0.000        50.941    50.941
x20        -2.409e-14   6.21e-15     -3.877      0.000     -3.63e-14 -1.19e-14
x21         5.862e-14   6.13e-15      9.569      0.000      4.66e-14  7.06e-14
x22        -5.773e-14   6.03e-15     -9.580      0.000     -6.96e-14 -4.59e-14
x23        -4.147e-14   6.17e-15     -6.722      0.000     -5.36e-14 -2.94e-14
x24         5.773e-15   5.92e-15      0.975      0.330     -5.85e-15  1.74e-14
x25          9.77e-15   6.02e-15      1.624      0.105     -2.04e-15  2.16e-14
x26           74.2325   6.06e-15   1.23e+16      0.000        74.232    74.232
x27         5.107e-15   6.13e-15      0.833      0.405     -6.92e-15  1.71e-14
x28        -1.865e-14   6.17e-15     -3.022      0.003     -3.08e-14 -6.54e-15
x29         1.976e-14   6.24e-15      3.166      0.002      7.51e-15   3.2e-14
x30        -1.987e-14   5.97e-15     -3.329      0.001     -3.16e-14 -8.16e-15
x31         1.177e-14   6.04e-15      1.948      0.052     -8.68e-17  2.36e-14
x32         1.421e-14   6.11e-15      2.327      0.020      2.23e-15  2.62e-14
x33         5.329e-14   6.04e-15      8.818      0.000      4.14e-14  6.52e-14
x34        -1.332e-14   6.09e-15     -2.189      0.029     -2.53e-14 -1.38e-15
x35        -1.388e-14   6.11e-15     -2.271      0.023     -2.59e-14 -1.89e-15
x36           41.7912   5.96e-15   7.02e+15      0.000        41.791    41.791
x37        -1.177e-14   6.37e-15     -1.849      0.065     -2.43e-14  7.25e-16
x38        -2.576e-14      6e-15     -4.295      0.000     -3.75e-14  -1.4e-14
x39        -1.266e-14    6.3e-15     -2.010      0.045      -2.5e-14 -2.97e-16
x40         7.841e-15   6.05e-15      1.297      0.195     -4.02e-15  1.97e-14
x41        -3.902e-14   6.06e-15     -6.443      0.000     -5.09e-14 -2.71e-14
x42        -2.576e-14    6.2e-15     -4.157      0.000     -3.79e-14 -1.36e-14
x43        -9.437e-15   6.28e-15     -1.503      0.133     -2.18e-14  2.88e-15
x44        -8.882e-16   6.47e-15     -0.137      0.891     -1.36e-14  1.18e-14
x45        -2.509e-14   6.23e-15     -4.026      0.000     -3.73e-14 -1.29e-14
x46        -1.121e-14   6.24e-15     -1.797      0.073     -2.35e-14  1.04e-15
x47        -2.487e-14   6.14e-15     -4.048      0.000     -3.69e-14 -1.28e-14
x48         -1.11e-14   6.35e-15     -1.748      0.081     -2.36e-14  1.36e-15
x49        -2.354e-14   6.32e-15     -3.724      0.000     -3.59e-14 -1.11e-14
x50         3.286e-14   6.01e-15      5.472      0.000      2.11e-14  4.46e-14
==============================================================================
Omnibus:                        1.924   Durbin-Watson:                   1.935
Prob(Omnibus):                  0.382   Jarque-Bera (JB):                1.883
Skew:                          -0.038   Prob(JB):                        0.390
Kurtosis:                       3.198   Cond. No.                         1.56
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Scikit-learn implements an $F$-test to assess the significance of each predictor considered individually. The null hypothesis is that a model built on that predictor (only) is no better than a null model, and the alternative hypothesis is that it is better. This can be combined with functions such as SelectKBest to quickly select features - see http://scikit-learn.org/stable/modules/feature_selection.html. However, on its own it doesn't allow for basic inferences such as confidence intervals for those predictors.


In [4]:
scores = f_regression(X, y)

In [5]:
print(scores[0])  # F-statistics
print(np.argmin(scores[1]))  # index of smallest p-value


[  9.19806063e-02   8.84142196e-02   9.46082352e-01   1.08791367e-01
   1.18435102e-01   1.78828783e-01   4.80483108e-01   4.26620241e-01
   4.26903127e-01   1.12012505e+00   8.28503181e-01   1.14236934e+00
   1.85877560e+00   4.10908060e-01   2.01047641e+00   4.70509086e-01
   7.30862228e-01   4.47361769e-01   3.03368979e+02   1.90388315e+00
   1.87794687e+00   1.22875077e-01   2.95773411e-02   1.05895233e+00
   1.03917772e+00   1.17183957e+03   8.61343220e-03   1.12153011e-01
   1.38546446e+00   1.66019211e-01   5.56303351e-01   9.69038923e-01
   1.29548390e+00   7.89639925e-01   3.93651926e-01   2.43124121e+02
   4.26807712e-02   6.79000846e-01   6.70821402e-01   7.79284804e-03
   8.32940504e-02   2.47618203e-02   8.37188990e-02   1.81939541e-01
   4.04250578e+00   7.40928376e-01   1.23268442e+00   1.24342244e-01
   1.08198454e-01   4.62288218e+00]
25

In [6]:
np.argpartition(scores[1], 3)[:3]  # returns indices of minimum 3 p-values in list; i.e. top 3 features


Out[6]:
array([25, 18, 35])

I generally prefer $l1$ regularization, which performs a 'continuous' version of stepwise variable selection, to traditional stepwise selection or sklearn's F-test. This needs cross-validation to tune effectively. The downside to it is that it often misses highly correlated features (typically, will latch onto one and drop the others). That likely won't effect the predictive quality of the final model, but may effect the stability of the model when refit to different data.


In [7]:
# lasso = l1 regularized linear regression
lasso_model = LassoCV(n_jobs=-1, random_state=0, normalize=True, verbose=1, copy_X = True)
lasso_model.fit(X,y)

lasso_model.coef_


............................................................................................................................................................................................................................................................................................................[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.1s finished
Out[7]:
array([ -0.        ,  -0.        ,  -0.        ,  -0.        ,
         0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,   0.        ,  -0.        ,   0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,   0.        ,  50.86568652,  -0.        ,
         0.        ,  -0.        ,  -0.        ,  -0.        ,
         0.        ,  74.1571263 ,   0.        ,   0.        ,
        -0.        ,  -0.        ,   0.        ,   0.        ,
        -0.        ,  -0.        ,   0.        ,  41.72164592,
         0.        ,  -0.        ,   0.        ,   0.        ,
        -0.        ,   0.        ,   0.        ,   0.        ,
        -0.        ,  -0.        ,  -0.        ,  -0.        ,
        -0.        ,   0.        ])

In [8]:
[x for x in range(len(lasso_model.coef_)) if x not in np.where(np.isclose(0.0, lasso_model.coef_))[0]]


Out[8]:
[18, 25, 35]

In [9]:
eps = 5e-5 #alpha_min / alpha_max; smaller = longer path
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps)

figsize(12,6)
plt.plot(np.log(alphas_lasso), coefs_lasso.T)
plt.xlabel('-Log(alpha)')
plt.ylabel('coefficients');


A fourth approach (another of my favorites): use random forest feature importances. This requires fitting the random forest model in full, which can be computationally expensive.


In [10]:
rf_model = RF(n_jobs=-1, verbose=1, n_estimators=500)
rf_model.fit(X,y)

importance_scores = rf_model.feature_importances_
np.argpartition(importance_scores, -3)[-3:]  # need the maximum values here


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.6s
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed:    2.2s
[Parallel(n_jobs=-1)]: Done 442 tasks      | elapsed:    4.8s
[Parallel(n_jobs=-1)]: Done 500 out of 500 | elapsed:    5.4s finished
Out[10]:
array([35, 25, 18])

There are many other methods. For a detailed list of options available in scikit-learn, look here: http://scikit-learn.org/stable/modules/feature_selection.html.


In [11]:
plt.scatter(importance_scores, range(len(importance_scores)))
plt.xlabel('Importance')
plt.ylabel('Variable')
plt.title('Random Forest Variable Importances');



In [ ]: